Self-healing in large-scale systems: parallel and distributed diagnostic architectures

نویسندگان

  • Shang Guo
  • Irina Rish
  • David Loewenstern
چکیده

Automated real-time problem diagnosis is a key feature of a self-healing system. However, rapidly growing size and complexity of modern distributed systems creates a challenge for traditional centralized diagnostic approaches and calls for parallel and distributed architectures. Dividing the system into subsystems controlled by separate diagnostic engines is an obvious choice; however, on top of that, a communication architecture must be provided that allows diagnostic engines to exchange information about common components in order to obtain better diagnosis. In this paper, we discuss a distributed belief propagation approach to diagnosis and provide a scalable parallel and distributed communication architecture that supports efficient message exchange among diagnostic engines.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Static Task Allocation in Distributed Systems Using Parallel Genetic Algorithm

Over the past two decades, PC speeds have increased from a few instructions per second to several million instructions per second. The tremendous speed of today's networks as well as the increasing need for high-performance systems has made researchers interested in parallel and distributed computing. The rapid growth of distributed systems has led to a variety of problems. Task allocation is a...

متن کامل

Crash Management for Distributed Parallel Systems

With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organic computing’s self-x features. Self-healing in this context means that computer clusters should detect and handle failures automatically. This paper presents a self-healing mechanism based on checkpointing, so that a c...

متن کامل

Design of a Simulator for Large-Scale Distributed Shared-Memory Cache-Coherent Architectures

As the scale and the complexity of parallel computer systems grow rapidly, the study of interactions between application algorithms and parallel architectures becomes more important. Execution-driven simulation under realistic workloads proves to be an accurate and eecient technique for studying the performance of computer systems. However, direct-execution simulation of shared-memory cache-coh...

متن کامل

Self healing distributed systems

The growing complexity of distributed systems demands for new ways of control. This work addresses self-healing in distributed environments. The term self-healing represents a quite new area of research and is used in a fairly broad way, but can be seen as dynamic fault tolerance. This work proposes generic concepts and algorithms to build self-healing systems. The detection of node failures in...

متن کامل

The Self Distributing Virtual Machine (SDVM) - Making Computer Clusters Heal Themselves

With the rapidly growing capability of computer architectures their complexity grows as well. More and more parallelism is necessary to provide the needed computing power. Moreover, systems must adapt to changing environments and cope with a breakdown of components. One approach is to incorporate organic features into computer systems. Organic computers [14] are characterized by self-x properti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005